Lessons Learned in Automatically Detecting Lists in OCRed Historical Documents
نویسندگان
چکیده
Lists are often the most data-rich parts of a document collection, but are usually not set apart explicitly from the rest of the text, especially in a corpus of historical OCRed documents. There are many kinds of lists, differing from each other in both layout and content. Writing individualized code to process all possible types of lists is an expensive challenge. In the present research, we focus on general list detection, the first step in a larger process of general list reading. Our system, ListDetector, automatically locates lists from noisy word labels obtained from cheaply developed dictionaries and regular expressions. In this paper, we start by describing a simple baseline system—the first system we are aware of to address general list detection in plain text or OCRed documents. From there, we present several of the challenges and corresponding solutions that we discovered as we raised the F-measure of ListDetector from 79% to 86.3%. We compute evaluation metrics against a gold standard corpus of OCRed documents in the family history domain that we have manually annotated for the tasks of list detection and structure recogniton. We will continue adding to this corpus and make it publically available for other researchers to use.
منابع مشابه
Populating Ontologies with Data from OCRed Lists
A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...
متن کاملPopulating Ontologies by Semi-automatically Inducing Information Extraction Wrappers for Lists in OCRed Documents
A flexible, accurate, and efficient method of extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine queryable, linkable, and editable. But, to work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose a wrapper-induction solution for...
متن کاملScalable Recognition, Extraction, and Structuring of Data from Lists in OCRed Text using Unsupervised Active Wrapper Induction
A process for accurately and automatically extracting asserted facts from lists in OCRed documents and inserting them into an ontology would contribute to making a variety of historical documents machine searchable, queryable, and linkable. To work well, such a process should be adaptable to variations in document and list format, tolerant of OCR errors, and careful in its selection of human gu...
متن کاملExtracting and Organizing Facts of Interest from OCRed Historical Documents
Historical documents contain facts that family history enthusiasts are interested in extracting. In addition to fact extraction, organizing these facts into disambiguated entity records is also of interest. This paper shows how facts from an excerpt of a page in an OCRed book can be gathered automatically with some expert knowledge.
متن کاملPopulating Ontologies with Data from Lists in Family History Books
A flexible, accurate, and cost-effective method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its sel...
متن کامل